A Framework for Language-Independent Analysis and Prosodic Feature Annotation of Text Corpora

نویسندگان

  • Dimitris Spiliotopoulos
  • Georgios Petasis
  • Georgios Kouroupetroglou
چکیده

Concept-to-Speech systems include Natural Language Generators that produce linguistically enriched text descriptions which can lead to significantly improved quality of speech synthesis. There are cases, however, where either the generator modules produce pieces of non-analyzed, non-annotated plain text, or such modules are not available at all. Moreover, the language analysis is restricted by the usually limited domain coverage of the generator due to its embedded grammar. This work reports on a language-independent framework basis, linguistic resources and language analysis procedures (word/sentence identification, partof-speech, prosodic feature annotation) for text annotation/processing for plain or enriched text corpora. It aims to produce an automated XML-annotated enriched prosodic markup for English and Greek texts, for improved synthetic speech. The markup includes information for both training the synthesizer and for actual input for synthesising. Depending on the domain and target, different methods may be used for automatic classification of entities (words, phrases, sentences) to one or more preset categories such as “emphatic event”, “new/old information”, “second argument to verb”, “proper noun phrase”, etc. The prosodic features are classified according to the analysis of the speech-specific characteristics for their role in prosody modelling and passed through to the synthesizer via an extended SOLE-ML description. Evaluation results show that using selectable hybrid methods for part-of-speech tagging high accuracy is achieved. Annotation of a large generated text corpus containing 50% enriched text and 50% canned plain text produces a fully annotated uniform SOLE-ML output containing all prosodic features found in the initial enriched source. Furthermore, additional automatically-derived prosodic feature annotation and speech synthesis related values are assigned, such as word-placement in sentences and phrases, previous and next word entity relations, emphatic phrases containing proper nouns, and more.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A new approach to the analysis and annotation of speech and prosody based on computerized cross-linguistic corpora

In the present paper, corpus linguistics becomes a valuable methodological tool for cross-linguistic research on speech and prosody. The inherent complexity of speech analysis and prosodic annotation increases when the object of study is a longitudinal computerized corpus of native and nonnative varieties of English. The lack of generally accepted prosodic transcription systems adds further dif...

متن کامل

From English pitch accent detection to Mandarin stress detection, where is the difference?

Although English pitch accent detection has been studied extensively, there relatively a few works explore Mandarin stress etection. Moreover, the comparison and analysis between Mandarin stress detection and English pitch accent detection have not een touched for such counterpart tasks. In this paper, we discuss Mandarin stress detection and compare it with English pitch accent etection. The c...

متن کامل

The Intonational Phonology of Catalan

This chapter presents an analysis of the prosodic and intonational structure of Catalan within the Autosegmental-Metrical (AM) framework (Pierrehumbert 1980, Pierrehumbert and Beckman 1988, Ladd 1996, Gussenhoven 2004, Jun 2005, and Beckman et al. 2005, among others). Based on this analysis, we have developed the Cat_ToBI system of prosodic annotation of Catalan corpora (Prieto, Aguilar, Mascar...

متن کامل

Prosodically Enriched Text Annotation for High Quality Speech Synthesis

Linguistically enriched text generated from natural language modules contributes significantly on the quality of speech synthesis. For all cases where such modules are not available, such enriched input needs to be produced from plain text in order to maintain quality. This work reports on a framework of several combined language resources and procedures (word/sentence identification, syntactic...

متن کامل

Design and Evaluation of Shared Prosodic Annotation for Spontaneous French Speech: From Expert Knowledge to Non-Expert Annotation

In the area of large French speech corpora, there is a demonstrated need for a common prosodic notation system allowing for easy data exchange, comparison, and automatic annotation. The major questions are: (1) how to develop a single simple scheme of prosodic transcription which could form the basis of guidelines for non-expert manual annotation (NEMA), used for linguistic teaching and researc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008